Explore different perspectives and approaches to create more truthful and enlightening data visualisation
Anticipating rapid growth, the city of Engagement, Ohio USA is doing a participatory urban planning exercise to understand the current state of the city and identify opportunities for future growth. About 1000 representative residents in this modest-sized city have agreed to provide data using the city’s urban planning app, which records the places they visit, their spending, and their purchases, among other things. From these volunteers, the city will have data to assist with their major community revitalization efforts, including how to allocate a very large city renewal grant they have recently received.
Economic considers the financial health of the city. How does the financial health of the residents change over the period covered by the dataset? How do wages compare to the overall cost of living in Engagement? Are there groups that appear to exhibit similar patterns?
In this exercise, we will explore different perspectives and approaches to create more enlightening data visualisation on dataset VAST Challenge 2022. The operation was carried out on Rstudio and main packages used are tidyverse and ggplot2 extensions.
Before we get started, it is important for us to ensure that the required R packages have been installed. If yes, we will load the R packages. If they have yet to be installed, we will install the R packages and load them onto R environment.
The chunk code on the right will do the trick.
packages = c('tidyverse', 'knitr', 'ggdist', 'ggridges',
'scales', 'grid', 'gridExtra','plotly',
'ggrepel', 'formattable', 'patchwork',
'ggiraph', 'lubridate', 'data.table',
'ggthemes','gganimate','gifski','gapminder')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
The code chunk below imports participants.csv and FinancialJournal.csv into R environment using read_csv() function of readr package.
participants <- read_csv('data/Participants.csv')
financial <- read_csv('data/FinancialJournal.csv')
It is always a good practice to examine the imported data frame before further analysis is performed.
For example, kable() can be used to review the structure of the imported data frame.
Let’s take an overview of the datasets.
kable(head(participants))
| participantId | householdSize | haveKids | age | educationLevel | interestGroup | joviality |
|---|---|---|---|---|---|---|
| 0 | 3 | TRUE | 36 | HighSchoolOrCollege | H | 0.0016267 |
| 1 | 3 | TRUE | 25 | HighSchoolOrCollege | B | 0.3280865 |
| 2 | 3 | TRUE | 35 | HighSchoolOrCollege | A | 0.3934696 |
| 3 | 3 | TRUE | 21 | HighSchoolOrCollege | I | 0.1380634 |
| 4 | 3 | TRUE | 43 | Bachelors | H | 0.8573967 |
| 5 | 3 | TRUE | 32 | HighSchoolOrCollege | D | 0.7729578 |
kable(head(financial))
| participantId | timestamp | category | amount |
|---|---|---|---|
| 0 | 2022-03-01 | Wage | 2472.50756 |
| 0 | 2022-03-01 | Shelter | -554.98862 |
| 0 | 2022-03-01 | Education | -38.00538 |
| 1 | 2022-03-01 | Wage | 2046.56221 |
| 1 | 2022-03-01 | Shelter | -554.98862 |
| 1 | 2022-03-01 | Education | -38.00538 |
In order to understand the financial health of the residents change over the period, we need to derive income, overall cost and balance of residents in a monthly basis.
Monthly income/cost for residents need to be derived to view the change over recorded 15 months. Code chunk below shows how we change time format to monthly basis.
Switch to monthly basis
# A tibble: 1,856,330 x 4
participantId category amount yearmonth
<dbl> <chr> <dbl> <chr>
1 0 Wage 2473. 2022.03
2 0 Shelter -555. 2022.03
3 0 Education -38.0 2022.03
4 1 Wage 2047. 2022.03
5 1 Shelter -555. 2022.03
6 1 Education -38.0 2022.03
7 2 Wage 2437. 2022.03
8 2 Shelter -557. 2022.03
9 2 Education -12.8 2022.03
10 3 Wage 2367. 2022.03
# ... with 1,856,320 more rows
Convert 15 months into serial numbers
mon_convert <- function(y, m){mon = 12*(y-2022)+m-2}
print(mon_convert(2023,5))
[1] 15
Convert time format to year and month
Convert 15 months into serial numbers
MonthlyFinancial <- year_month %>%
mutate(SerialMonth = mon_convert(year_month$y, year_month$m))
summary(MonthlyFinancial$SerialMonth)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 4.000 8.000 7.732 11.000 15.000
Firstly, we need to use group_by to group individual expense and income category. And then use summarise function to summarize each category.
summarizedFinancial <- MonthlyFinancial %>%
group_by(participantId, category, SerialMonth, yearmonth) %>%
summarise(monthly_financial = sum(amount))
summarizedFinancial
# A tibble: 55,498 x 5
# Groups: participantId, category, SerialMonth [55,498]
participantId category SerialMonth yearmonth monthly_financial
<dbl> <chr> <dbl> <chr> <dbl>
1 0 Education 1 2022.03 -76.0
2 0 Education 2 2022.04 -38.0
3 0 Education 3 2022.05 -38.0
4 0 Education 4 2022.06 -38.0
5 0 Education 5 2022.07 -38.0
6 0 Education 6 2022.08 -38.0
7 0 Education 7 2022.09 -38.0
8 0 Education 8 2022.10 -38.0
9 0 Education 9 2022.11 -38.0
10 0 Education 10 2022.12 -38.0
# ... with 55,488 more rows
Then, the dataframe need to be pivoted using code chunk below.
Financial <- summarizedFinancial %>%
pivot_wider(names_from = category, values_from = monthly_financial)
Financial[is.na(Financial)] = 0
Financial
# A tibble: 13,331 x 9
# Groups: participantId, SerialMonth [13,331]
participantId SerialMonth yearmonth Education Food Recreation
<dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 0 1 2022.03 -76.0 -268. -349.
2 0 2 2022.04 -38.0 -266. -219.
3 0 3 2022.05 -38.0 -265. -383.
4 0 4 2022.06 -38.0 -257. -466.
5 0 5 2022.07 -38.0 -270. -1070.
6 0 6 2022.08 -38.0 -262. -314.
7 0 7 2022.09 -38.0 -256. -295.
8 0 8 2022.10 -38.0 -267. -25.0
9 0 9 2022.11 -38.0 -261. -377.
10 0 10 2022.12 -38.0 -266. -357.
# ... with 13,321 more rows, and 3 more variables: Shelter <dbl>,
# Wage <dbl>, RentAdjustment <dbl>
To show the change of financial situation of residents during this 15 months,we need to calculate the monthly income , monthly living cost as well as monthly balance and then used them to visualize in the next part.
Residents’ monthly income is derived by calculating the sum of wage.
Cost of living is made up of expenses from education, food, recreation, shelter, and offset the rent adjustment. Residents’ monthly cost of living is derived by calculating the sum of above expenses.
FINANCIAL <- Financial %>%
mutate(monthly_cost = Education + Food + Recreation
+ Shelter + RentAdjustment) %>%
mutate(monthly_income = Wage) %>%
mutate(monthly_balance = monthly_income + monthly_cost)
FINANCIAL
# A tibble: 13,331 x 12
# Groups: participantId, SerialMonth [13,331]
participantId SerialMonth yearmonth Education Food Recreation
<dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 0 1 2022.03 -76.0 -268. -349.
2 0 2 2022.04 -38.0 -266. -219.
3 0 3 2022.05 -38.0 -265. -383.
4 0 4 2022.06 -38.0 -257. -466.
5 0 5 2022.07 -38.0 -270. -1070.
6 0 6 2022.08 -38.0 -262. -314.
7 0 7 2022.09 -38.0 -256. -295.
8 0 8 2022.10 -38.0 -267. -25.0
9 0 9 2022.11 -38.0 -261. -377.
10 0 10 2022.12 -38.0 -266. -357.
# ... with 13,321 more rows, and 6 more variables: Shelter <dbl>,
# Wage <dbl>, RentAdjustment <dbl>, monthly_cost <dbl>,
# monthly_income <dbl>, monthly_balance <dbl>
In order to show the income and consumption patterns in different groups, we will combine FINANCIAL and participants dataframe together.
combine <- FINANCIAL %>%
left_join(participants, by = "participantId")
combine
# A tibble: 13,331 x 18
# Groups: participantId, SerialMonth [13,331]
participantId SerialMonth yearmonth Education Food Recreation
<dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 0 1 2022.03 -76.0 -268. -349.
2 0 2 2022.04 -38.0 -266. -219.
3 0 3 2022.05 -38.0 -265. -383.
4 0 4 2022.06 -38.0 -257. -466.
5 0 5 2022.07 -38.0 -270. -1070.
6 0 6 2022.08 -38.0 -262. -314.
7 0 7 2022.09 -38.0 -256. -295.
8 0 8 2022.10 -38.0 -267. -25.0
9 0 9 2022.11 -38.0 -261. -377.
10 0 10 2022.12 -38.0 -266. -357.
# ... with 13,321 more rows, and 12 more variables: Shelter <dbl>,
# Wage <dbl>, RentAdjustment <dbl>, monthly_cost <dbl>,
# monthly_income <dbl>, monthly_balance <dbl>, householdSize <dbl>,
# haveKids <lgl>, age <dbl>, educationLevel <chr>,
# interestGroup <chr>, joviality <dbl>
To visualise the financial change during this 15 months, we will use ridge plot to show the economic situation
p1 <- ggplot(combine,
aes(x=Wage,
y=combine$yearmonth,
fill = factor(stat(quantile)))) +
stat_density_ridges(geom = "density_ridges_gradient",
calc_ecdf = TRUE,
quantiles = 4,
quantile_lines = TRUE) +
scale_fill_viridis_d(name = "Quartiles") +
labs(x= "Wage",
y= "Time",
title="Distribution of Residents' Wage")
p1
According to above ridge plot, we can know that residents’ wage in 2022 March is higher than the following 14 months.
p2 <- ggplot(combine,
aes(x = combine$monthly_balance,
y = combine$yearmonth))+
geom_density_ridges(jittered_points = TRUE,
position = position_points_jitter(width = 0.05,
height = 0),
point_shape = '|',
point_size = 3,
point_alpha = 1,
alpha = 0.7,) +
stat_density_ridges(geom = "density_ridges_gradient", calc_ecdf = TRUE) +
geom_density_ridges_gradient(scale = 2, rel_min_height = 0.01) +
scale_fill_viridis_c(name = "Monthly Balance", direction = -1) +
geom_vline(aes(xintercept=mean(combine$monthly_balance, na.rm=T)),
color="red",
linetype="dashed",
size=0.5) +
geom_vline(aes(xintercept=median(combine$monthly_balance, na.rm=T)),
color="blue",
linetype="dashed",
size=0.5) +
theme(axis.title.y=element_text(angle=0),
axis.line = element_line(color='grey'),
plot.title = element_text(hjust = 0.5),
axis.title.y.left = element_text(vjust = 0.5,),
axis.text = element_text(face="bold")) +
labs(x= "Monthly Balance",
y= "Time",
title="Distribution of Residents' Monthly Balance")
p2
Accordingly, the distribution of monthly balance is in a similar pattern. And the dot plot delow the ridge plot shows that people with higher balance is affected even more serious.
Residents’ wages are more widely distributed in the early stage and perform in a similar pattern after the 1st month.
Distribution of monthly balance versus education level
First, Let’s have a general overview of the distribution of monthly balance with different education level.
p <- ggplot(data = combine,
aes(x=monthly_balance,
color = educationLevel)) +
geom_density() +
labs(x= "Monthly Balance",
y= "Density",
title="Density distribution of Residents' Monthly Balance",
subtitle= 'Demographics in Engagement, Ohio',
caption = "Source: VAST Challenge 2022") +
theme(panel.background = element_blank(),
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
axis.title.y = element_text(angle=0,vjust = 0.5),
axis.ticks.x = element_blank(),
axis.line= element_line(color= 'grey'),
axis.text.x = element_text(size=8,angle=0),
panel.grid.major.y = element_line(color= 'grey', size = 0.1),
plot.caption = element_text(hjust=0),
legend.key = element_rect(fill= NA))
ggplotly(p)
The density plot shows that there are more people from higher the education level for the higher monthly balance.
Below, ggdist package is used to plot raincloud plots so as to show more details of the distribution of monthly balance in diffrent education level.
ggplot(combine, aes(x = educationLevel,
y = monthly_balance)) +
scale_y_continuous(breaks = seq(0, 18000, 3000),
limits = c(0, 18000)) +
stat_halfeye(adjust = 0.5,
width = .66,
color = NA,
justification = -.01,
position = position_nudge(x = .15)) +
geom_boxplot(width = .20,
outlier.shape = NA) +
stat_summary(geom = "point",
fun.y = "mean",
colour = "red",
size = 1) +
stat_dots(side = "left",
justification = 1.3,
binwidth = .25,
dotsize = 5) +
scale_color_manual(name= 'Statistics',
values = (Mean= '#f6546a')) +
labs(title = 'Monthly Balance in Groups with Diffrent Education Levels',
x = 'Education Level',
y = 'Monthly Balance',
subtitle= 'Demographics in Engagement, Ohio',
caption = "Source: VAST Challenge 2022") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))+
coord_flip()
A combination of raincloud plot, boxplot and dot plot are used here. It further clarify our belief that education level means a lot to the wealth accumulation.
Distribution of monthly income versus cost in different education levels
The code chunk below shows the change of monthly cost versus monthly income during this 15 months in an iteratively way.
ggplot(combine, aes(x = monthly_income,
y = abs(monthly_cost),
size = Wage,
colour = combine$educationLevel)) +
geom_point(alpha = 0.5,
show.legend = TRUE) +
labs(title = 'Month: {frame_time}',
x = 'Monthly Cost',
y = 'Monthly Income') +
transition_time(as.integer(SerialMonth)) +
ease_aes('linear')
In the beginning, the change of income and cost changed quickly, then it leveled off.
Distribution of Monthly Income of Residents from Diffrent Interest Group
tooltip <- function(y, ymax, accuracy = .01) {
mean <- scales::number(y, accuracy = accuracy)
sem <- scales::number(ymax - y, accuracy = accuracy)
paste("Mean Income per Month:", mean, "+/-", sem)
}
p <- ggplot(data=combine,
aes(x = interestGroup),) +
stat_summary(aes(y = monthly_income,
tooltip = after_stat(tooltip(y, ymax))),
fun.data = "mean_se",
geom = GeomInteractiveCol,
fill = "light blue") +
stat_summary(aes(y = monthly_income),
fun.data = mean_se,
geom = "errorbar", width = 0.2, size = 0.2) +
labs(title = 'Monthly Income of Residents from Diffrent Interest Group',
x = 'Interest Group',
y = 'Monthly Income',
subtitle= 'Demographics in Engagement, Ohio',
caption = "Source: VAST Challenge 2022") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
girafe(ggobj = p,
width_svg = 8,
height_svg = 8*0.618)
Table above shows that the average wage of residents having interest of D is the highest and that of F is the lowest.
Distribution of Monthly Cost of Residents from Diffrent Interest Group
p <- ggplot(combine, aes(x = interestGroup,
y = abs(monthly_cost))) +
scale_y_continuous(breaks = seq(0, 5000, 1000),
limits = c(0, 5000)) +
stat_halfeye(adjust = 0.5,
width = .66,
color = NA,
justification = -.01,
position = position_nudge(x = .15)) +
geom_boxplot(width = .20,
outlier.shape = NA) +
stat_summary(geom = "point",
fun.y = "mean",
colour = "red",
size = 1) +
stat_dots(side = "left",
justification = 1.3,
binwidth = .25,
dotsize = 5) +
scale_color_manual(name= 'Statistics',
values = (Mean= '#f6546a')) +
labs(title = 'Monthly Cost of Residents from Diffrent Interest Group',
x = 'Interest Group',
y = 'Monthly Cost',
subtitle= 'Demographics in Engagement, Ohio',
caption = "Source: VAST Challenge 2022") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))+
coord_flip()
p
ggplotly(p)
While the mean income of those from interest group F is the lowest, it’s atotally a different case for the mean of the cost. The average cost of those from interest group F is the highest.
Interactivity is important in visual analysis, but it should be used appropriately, focused on the main point, and not obscured the core meaning of data for the sake of animation.
In visual analysis, we should try a variety of methods from a variety of perspectives, so as to choose the most suitable.